{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banner_Top_06.06.18.jpg?raw=true)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Working With Open Data: Car Mileage Data Part 1\n",
    "\n",
    "\n",
    "\n",
    "As high quality open data is becoming available, having the skills to analyze that data is becoming more important. In this notebook series we will walk  through the basics of working with open data using a Jupyter notebook. We will also demonstrate some of the first steps you may take to perform data exploration and visualization. These are introductory notebooks for someone who may not have seen any code or a Jupyter notebook before. As such, we will walk through each step involved in detail and keep our analysis at a high level. Of course, that is not to say a high level analysis can't also lead to interesting conclusions or realizations.\n",
    "\n",
    "\n",
    "# Bringing Open Data Into Jupyter\n",
    "\n",
    "The easiest way to load open data into Jupyter is using a Python package called pandas. Pandas is a free and open source library for Python that allows you to manipulate and analyze almost any data  easily and quickly. It can load CSV files, spreadsheets, text files, and even tables directly from webpages. In essence, you can think of the pandas library as \"Excel on steroids\".\n",
    "\n",
    "For this first demonstration, we will walk you through the process of downloading, manipulating, and visualizing an open data set from an open data portal. We'll be using this website: [data.opendatasoft.com](https://data.opendatasoft.com). In particular we will be using a rather large data set of vehicle fuel economy available at [this link](https://data.opendatasoft.com/explore/dataset/us-vehicle-fuel-economy-data-1984-2017%40kapsarc/table/?disjunctive.make&disjunctive.model&sort=-year). If you travel to that page, you should see a data set like the picture below.\n",
    "\n",
    "> ![Screenshot of the data.opensoft.com](images/car_data_screen.png)\n",
    "\n",
    "From that screen if you were to click the large \"Export\" button it will take you to this page:\n",
    "\n",
    ">![Screenshot of the download screen](images/downloadscreen.png)\n",
    "\n",
    "If you click the link beside your desired file format, you could download that directly to your computer. As we're working on the hub, it might be tedious to download the file to our computer then re-upload it to the hub. Instead, we will get this notebook to download that data directly from the site. If you right click on the link besides the CSV file and click \"Copy Link Address\" you'll get the path to the actual data file. We will be using this link to download the car data into Jupyter. \n",
    "\n",
    "Before we can download that data we have a few book keeping items to take care of. First, we need to import the pandas library and set our environment up to place any graphs or visualizations we make directly into the notebook. This is done in the cell below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This first line imports our pandas library, and gives it the name \"pd\" (so we can type less)\n",
    "import pandas as pd\n",
    "# we also import the plotting library\n",
    "import matplotlib.pyplot as plt \n",
    "# This tells Jupyter we want to place our graphs directly into the notebook. \n",
    "%matplotlib inline\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Getting the Data Into Jupyter\n",
    "\n",
    "Using that link we copied earlier, we can import that data directly into Jupyter. Let's first make our lives a little easier and assign our URL to a variable named `url`, which is done below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Notice how we've placed the url between quotes!\n",
    "\n",
    "url = 'https://data.opendatasoft.com/explore/dataset/us-vehicle-fuel-economy-data-1984-2017@kapsarc/download/?format=csv&timezone=America/Denver&use_labels_for_header=true'"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now comes the matter of downloading that data set. Luckily pandas comes with a function called `read_csv`. Conveniently enough, this function reads a CSV file into what is called a _Dataframe_. For our purposes, a Dataframe can just be thought of as a table of data. \n",
    "\n",
    "There is one complication. A CSV file can be separated by either commas (\",\"), semi-colons (\";\"), or tabs (\"\\t\"). In our case, from the image we have above, we know that our data set is separated with semi-colons. By default pandas assumes data is separated by commas.  As such, we have to specify our alternative delimiter when downloading this data set. This is a rather large data set, and it could take a minute or two to download. You'll know when the download is complete when `In [*]` in the upper left corner outside of the cell below changes to `In [3]` (or a larger number if you've run multiple cells) \n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "''' \n",
    "Here we're actually downloading the data set which is at 'url', the download link we copied earlier. \n",
    "We've also specified our data delimeter with the 'sep' command, and we've stated that the character\n",
    "is a semi-colon. Finally, we're also assigning this data to a variable we've called car_data\n",
    "'''\n",
    "car_data = pd.read_csv(url, sep=';')\n",
    "\n",
    "# This writes our newly dowloaded file to a local CSV so we don't have to download it again\n",
    "# if we want to use this data later, the term in the quotes is the name of the file that \n",
    "# we will create while saving. \n",
    "# car_data.to_csv(\"car_data.csv\")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exploring the Data\n",
    "\n",
    "Now that we have our data set downloaded, let's take a look at the dataset that we've downloaded "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "car_data\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you scroll to the bottom of the table window above you'll see the the number of columns and rows of our table.  This particular data set has approximately 40000 rows and 83 columns. That's quite a lot of data! This table also includes \"non-numeric\" data such as car and model names, which can occasionally cause problems. You may also have noticed several entries in the table of `NaN`.  `NaN`  means there was no data entered at that location in the table and can be thought of \"an empty cell\".  But not to worry! We'll show you how easily those missing data points can be dealt with. \n",
    "\n",
    "\n",
    "### Getting Column Names\n",
    "For any analysis with a data table, it is important to know the column names of the data set. Unfortunately our table is too wide to see all entries in the table above. Luckily for us, we have other ways of extracting this information:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "With a dataframe, the Python function 'list' returns a list of the column headers\n",
    "in any given data frame. The column headers of our car_data data frame our printed\n",
    "below. \n",
    "'''\n",
    "\n",
    "# Note: If you want this to display as a single column, remove 'print' as well as its \n",
    "# parenthesis (the first and last ones) \n",
    "\n",
    "print(list(car_data))\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where a list of what many of the less obvious column headers represent are available [at this link](https://data.opendatasoft.com/explore/dataset/us-vehicle-fuel-economy-data-1984-2017%40kapsarc/information/?disjunctive.make&disjunctive.model&sort=-year). \n",
    "\n",
    "### Understanding the Data\n",
    "\n",
    "Before we work our way through using this data frame to create a visualization, sometimes it's useful to try and understand the data you have available. In our case, it is of interest to see how many car models and manufacturers there are in the data set. First, let's see how many models (i.e. Honda Civics, Ford Explorer, Lamborghini Huracán, etc.) are in the data set. This is done below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "Here we're calling the 'model' column only from within our data frame by using square \n",
    "brakets after our data frame name, and our column name in quotes. The len() function\n",
    "simply returns the length of a given list or column. \n",
    "The '.unique()' function returns only the unique models in the data set \n",
    "(if you remove .unique(), you'd get the length of the entire data set!)\n",
    "'''\n",
    "\n",
    "len(car_data['model'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "That's a lot of car models! At this point in any analysis without a specific goal in mind, or specific models to compare, we may have a few too many manufacturers for a reasonable exploration. Let's try the same to see how many car manufacturers are in the data set. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "len(car_data['make'].unique())\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is just for a \"quick view\" of what the car makers are. \n",
    "print(car_data['make'].sort_values().unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "135 car manufacturers is a lot more manageable than nearly 4000 car models for a first stab at analysis, let's start to visualize some of this data.\n",
    "\n",
    "### Visualizations \n",
    "\n",
    "Let's focus on average MPG (miles per gallon) for each car in both the city and on the highway. There are under the `UCity` and `UHighway` columns in the data table. If we're interested in all the data, plotting these data is very easy using pandas and shown below. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "Here we call the .plot() function from our ploting library on our data frame by typing\n",
    "'car_data.plot( ... ). The arguments inside the plot function are explained below\n",
    "\n",
    "1. x = 'UHighway'  : This specifies which data to use for the x component of each data point. In this case \n",
    "                     we're using 'UHighway', which is the fuel economy in MPG of each vehicle while driving\n",
    "                     on the highway\n",
    "                   \n",
    "2. y = 'UCity'     : This specifies which data to use for the y component of each data point. In this case\n",
    "                     we're using \"UCity\", which is the fuel economy in MPG for each vehicle while driving\n",
    "                     in the city\n",
    "                   \n",
    "3. kind = 'scatter': This specifies what kind of plot to create. In this case we're making a scatter plot, \n",
    "                     however you could also specify a line, bar, etc. plot here instead\n",
    "                     \n",
    "4. title = '...'   : This specifies the title to place on the top of the plot \n",
    "'''\n",
    "\n",
    "car_data.plot(x ='UHighway', \n",
    "              y = 'UCity', \n",
    "              kind = 'scatter', \n",
    "              title=\"All Car MPG\")\n",
    "                                                                                "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Where this plot doesn't tell us a lot, but to be fair we wouldn't expect it to! Suppose we were concerned with only one car manufacturer instead of _all_ the vehicles. For example, let's choose an every day car that anyone has access to: a Bentley. We can view only the Bentley data as follows"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "Here we're creating a new data table consisting only of the rows which have the \n",
    "word \"Bentely\" in the 'make' column. This allows us to quickly filter the data down\n",
    "to only the data that is relevant to cars made by Bentely. \n",
    "\n",
    "Essentially what this line of code does \n",
    "     \n",
    "     car_data[car_data['make'] == 'Bentley']\n",
    "     \n",
    "Is create an entirely *new* data frame of only rows where 'make' column is identical to \"Bentley\".\n",
    "This will ignore any and all data that is not associated with the word \"Bentley\" in the \"make\" column\n",
    "\n",
    "'''\n",
    "# Note that there are two equals signs! \n",
    "\n",
    "Bentley_data = car_data[car_data['make'] == 'Bentley']\n",
    "\n",
    "Bentley_data.plot(x ='UHighway',\n",
    "                  y = 'UCity',\n",
    "                  kind = 'scatter', \n",
    "                  title=\"Bently MPG\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we're starting to get some data that we might find more interesting! However, suppose we'd like to connect these dots with lines, we would need to change the above code to something like this"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "Bentley_data.plot(x ='UHighway',\n",
    "                  y = 'UCity',\n",
    "                  kind = 'line', \n",
    "                  title=\"Bently MPG Line Graph\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Of course, that graph is nearly incomprehensible! That's a result of our frame plotting points as it sees them - our data is unsorted! Luckily for us, this is easily remedied by sorting our data before we plot, as shown below"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "Here we're sorting *every row* of our data frame with respect to the specified column \"UHighway\" \n",
    "By using the sort_values function included with pandas\n",
    "'''\n",
    "\n",
    "Sorted_Bentley = Bentley_data.sort_values(\"UHighway\")\n",
    "Sorted_Bentley.plot(x ='UHighway',\n",
    "                  y = 'UCity',\n",
    "                  kind = 'line', \n",
    "                  title=\"Bently MPG Line Graph Sorted\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Formatting a Plot\n",
    "\n",
    "You may have noticed that all the plots we've been showing seem to lack a lot of \"pleasing\" formatting, and more important factors such as a labeled y-axes. The code block below specifies some of the basic settings that we can use to make our plots look a little more presentable. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "Here we explain what each line does that you haven't seen before using a different comment structure\n",
    "to introduce you to how you may see comments in other resources/online. \n",
    "'''\n",
    "\n",
    "Sorted_Bentley.plot(x ='UHighway',\n",
    "                  y = 'UCity',\n",
    "                  kind = 'line', \n",
    "                  figsize = (12,8),                # Set the figure size to be 12 x 8 (inches)\n",
    "                  color = 'red'                    # Changes the color of the plot, could also use color codes\n",
    "                   )\n",
    "# 'plt' is taken from the ploting library we imported at the beginning of the notebook \n",
    "# and is also used by pandas. Doing it out here gives us more control \n",
    "\n",
    "plt.title(\"Bently MPG Sorted\", size = 20)          # Set title with font size\n",
    "plt.xlabel(\"Highway Miles Per Gallon\" , size = 16) # Add an x axes, 'size' controls the font size\n",
    "plt.ylabel(\"City Miles Per Gallon\", size = 16)     # Add a y axes label\n",
    "plt.xticks(size = 14)                              # Change font size of xticks\n",
    "plt.yticks(size = 14)                              # Change font size of yticks\n",
    "plt.grid('on')                                     # Adds a grid to the plot\n",
    "plt.autoscale(tight=True)                          # This removes the padding around the plot "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "## Before You Try It Yourself\n",
    "\n",
    "We'll soon be at a part of this notebook where we encourage you to manipulate the plots directly. To help you get started, here are a few common error messages you may see in the process, and how to fix them. \n",
    "\n",
    "### Common Error Messages\n",
    "\n",
    "### Type Error\n",
    "Every time you have an error in the cell below, it is usually the result of a simple typo! When this happens you'll typically get a large and intimidating error message that makes it seem a lot worse than it is. For example if you ran a code cell with this:\n",
    "```python\n",
    "car_data[car_data['make'] == 'Fake Car Maker'].sort_values('UHighway').plot(x = 'UHighway',\n",
    "                                                                            y = 'UCity',\n",
    "                                                                            kind = 'line', \n",
    "                                                                            title=\" MPG Line Graph Sorted\")\n",
    "```\n",
    "\n",
    "You would get the following error message which would appear to be quite intimidating!\n",
    "\n",
    "![error messages](images/error.png)\n",
    "\n",
    "Where all that output is the result of making one small typo! But don't let that worry you! The important information is the first line of the error message where it says `TypeError` and shows you the approximate location of where the error was encountered. The other piece of important information is at the bottom of the output, you see the actual error message\n",
    "\n",
    "```python\n",
    "TypeError: Empty 'DataFrame': no numeric data to plot\n",
    "```\n",
    "\n",
    "What this is telling you is that you're trying to plot an empty dataframe. Which at first glance, seems impossible, we know that we have data in `car_data`! Where did it go?\n",
    "\n",
    "The culprit here is when we're filtering down to only the rows relevant to our car manufacturer. As there is no manufacturer \"Fake Car Maker\" In our data set, we're not returning any data when we try to filter it down. To fix this, all we need to do is change the text 'Fake Car Maker' to a car maker which is present in the data set. \n",
    "\n",
    "### Name Error\n",
    "\n",
    "If you were to run a line of code like this \n",
    "```python\n",
    "car_data[car_data['make'] == Bentley].sort_values('UHighway').plot(x='UHighway',\n",
    "                                                                   y = 'UCity',\n",
    "                                                                   kind = 'line', \n",
    "                                                                   title=\" MPG Line Graph Sorted\")\n",
    "```\n",
    "you would get the following error message\n",
    "\n",
    "```python\n",
    "---------------------------------------------------------------------------\n",
    "NameError                                 Traceback (most recent call last)\n",
    "<ipython-input-181-0856eb5a442a> in <module>()\n",
    "----> 1 car_data[car_data['make'] == Bentley].sort_values('UHighway').plot(x='UHighway',\n",
    "      2                                                                    y = 'UCity',\n",
    "      3                                                                    kind = 'line',\n",
    "      4                                                                    title=\" MPG Line Graph Sorted\")\n",
    "\n",
    "NameError: name 'Bentley' is not defined\n",
    "```\n",
    "\n",
    "This is caused because we have not wrapped our car maker in quotes and Python is looking for a variable named Bentley. To fix this, simply change Bentley $\\rightarrow$ 'Bentley'\n",
    "\n",
    "### Key Error\n",
    "\n",
    "Should you accidentally try to plot a column that doesn't exist due to a typo, such as the one in the a snippet like this\n",
    "\n",
    "```python\n",
    "car_data[car_data['make'] == 'Bentley'].sort_values('UHighway').plot(x='UHighway',\n",
    "                                                                     y = 'THIS IS NOT A COLUMN IN THE DATA SET',\n",
    "                                                                     kind = 'line', \n",
    "                                                                     title=\" MPG Line Graph Sorted\")\n",
    "```\n",
    "\n",
    "You would get a very long and intimidating error message, the bottom of which would be this error message\n",
    "```python\n",
    "KeyError: 'THIS IS NOT A COLUMN IN THE DATA SET'\n",
    "```\n",
    "\n",
    "What this is telling you is that there is no column in your table with the label `'THIS IS NOT A COLUMN IN THE DATA SET'`. To fix this, simply choose a column that you have labeled, or fix any minor typos.\n",
    "\n",
    "Those are the only error messages you're likely to see in the snippet below (or really any pandas plot like this!). Now that you know what they mean and how to fix them, you should be all set to try out the code below with any columns/data in our table. \n",
    "\n",
    "## Try It Out\n",
    "\n",
    "In the cell below we've set up the Bentley test case again, however, feel free to change the which manufacturer you're exploring. You should also feel free to explore other data for the $x$ and $y$ axis. Don't worry if something fails! Hopefully the common error message solutions above will help you out"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "car_data[car_data['make'] == 'Bentley'].sort_values('UHighway').plot(x='UHighway',\n",
    "                                                                     y = 'UCity',\n",
    "                                                                     kind = 'line', \n",
    "                                                                     title=\" MPG Line Graph Sorted\",\n",
    "                                                                     figsize = (12,8))\n",
    "\n",
    "plt.title(\"Bently MPG Sorted\", size = 20)          # Set title here with font size\n",
    "plt.xlabel(\"Highway Miles Per Gallon\" , size = 16) # Add an x axes, 'size' controls the font size\n",
    "plt.ylabel(\"City Miles Per Gallon\", size = 16)     # Add a y axes label\n",
    "plt.xticks(size = 14)                              # Change font size of xticks\n",
    "plt.yticks(size = 14)                              # Change font size of yticks\n",
    "plt.grid('on')                                     # Adds a grid to the plot\n",
    "plt.autoscale(tight=True)                          # This removes the padding around the plot "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you don't feel like scrolling all the way up to find other car make names, simply uncomment the cell below to see the list of car makers. See instructions in the cell to understand what we mean by \"Uncomment\" "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "'''\n",
    "To 'uncomment the line of code below, you simply need to delete the pound sign (hash tags for those of \n",
    "you who frequent Twitter) and the trailing space afterwads. In Python any characters in a line after a\n",
    "pound sign are ignored by the program and callend a comment as you have likely seen in many of the code blocks\n",
    "above. They can be very handy. You'll also notice that this block of text is contained between three\n",
    "quotes (') on each side. This is known as a \"block quote\" and is also ignored by Python, but it is more \n",
    "useful for typing longer messages such as this. \n",
    "'''\n",
    "\n",
    "# print(car_data['make'].sort_values().unique())\n",
    "None"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Conclusion\n",
    "\n",
    "In this notebook we demonstrated downloading an open data set directly from the Internet into a Jupyter notebook and to explore the data. We went through some basic filtering of a data set using `pandas` as well as how to create plots easily and effectively from the data. While we did not draw any interesting conclusions in this analysis, this represents a first initial step in data exploration. In part two of this notebook, we will outline the basics of data aggregation to discover more exciting and interesting trends hiding in this data set. In part two, we will start to tease more interesting trends out of this data set."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "![alt text](https://github.com/callysto/callysto-sample-notebooks/blob/master/notebooks/images/Callysto_Notebook-Banners_Bottom_06.06.18.jpg?raw=true)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.0"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
